CTMS: A Comparative Text Mining System

نویسندگان

  • Peng Zang
  • ChengXiang Zhai
چکیده

In many applications, there is often a need for comparing multiple text collections to find commonalities and differences in topical themes, a task we refer to as comparative text mining. In this paper, we present a general comparative mining system (CTMS). The CTMS system takes any two collections of text and generates a list of cross-collection themes and their associated individual collection-specific themes. The themes are linked to representative passages in each collection. The themes are represented as word distributions, and the underlying comparative mining algorithm is based on a probabilistic mixture model. The system carries out all the stages of text mining from data cleaning and preprocessing to the actual mining and post-processing, allowing users to perform comparative analysis between any two collections and navigate through the extracted theme space. This system can potentially be applied to a broad range of areas including opinion summarization, business intelligence, and summarization of text.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

The automatic extraction of comparative information is an important text mining problem and an area of increasing interest. In this paper, we study how to build a Korean comparison mining system. Our work is composed of two consecutive tasks: 1) classifying comparative sentences into different types and 2) mining comparative entities and predicates. We perform various experiments to find releva...

متن کامل

Designing a System for Trend Analysis of Users in Website Surfing in Iran Using Data Mining and Text Mining Algorithms

Background and Aim: As of the entrance of web surfing to the lifestyle of a vast majority of people in the society and the need for a more accurate social and cultural policy making in the field, authors intended to analyze the behavior of the society users in viewing different websites so as to help politicians and practitioners. Methods: Design science research method is used in this research...

متن کامل

The Anatomy of a Search and Mining System for Digital Archives

Samtla (Search And Mining Tools with Linguistic Analysis) is a digital humanities system designed in collaboration with historians and linguists to assist them with their research work in quantifying the content of any textual corpora through approximate phrase search and document comparison. The retrieval engine uses a character-based n-gram language model rather than the conventional word-bas...

متن کامل

Ranking of CTD articles and interactions using the OntoGene pipeline

In this paper we briefly describe the architecture of the OntoGene Relation mining pipeline and its application in the task 1 of BioCreative IV. The aim of the task is to deliver information useful for the triage of abstracts relevant to the process of curation of the Comparative Toxicogenomics Database. Although the main focus of our text mining research is the extraction of interactions, we d...

متن کامل

ارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متن‌کاوی در حوزه یادگیری الکترونیکی

As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005